fix(ocap-kernel): enforce one delivery per crank, fix rollback cache staleness by rekmarks · Pull Request #879 · MetaMask/ocap-kernel

rekmarks · 2026-03-17T23:58:55Z

As it turns out, we have been violating the invariant that a crank consists of the delivery of a single message or notification. Since at least the introduction of KernelQueue.ts in #484, one iteration of the kernel's run queue—which should be equivalent to a crank—has actually been able to deliver an unbounded number of messages.

This means that, if a delivery aborts mid-crank, rollbackCrank('start') reverts all deliveries in the crank (including earlier successful ones), creating inconsistency with vat in-memory state and leaving promise subscriptions permanently dangling.

This PR ensures that we correctly implement cranks via the kernel's run queue loop as described below.

Summary

Enforce one run-queue item per crank (change while to if in KernelQueue generator) and fix stale StoredQueue caches after rollbackCrank by refreshing the run queue and invalidating runQueueLengthCache
Reject JS promise subscriptions when a crank aborts with vat termination; fix terminateVat callback in Kernel to avoid deadlock by bypassing VatManager.terminateVat() (which calls waitForCrank())
Simplify the run queue implementation; in lieu of an async generator + loop, use a single loop with helper functions
Improve error messages for splat cases (revoked, no owner, no object, endpoint gone) and handle vanished endpoints in KernelRouter delivery
Fix SubclusterManager to catch rejected bootstrap promises
Add orphaned ephemeral exo tests (unit + e2e)
Glossary formatting and crank definition correction

Test plan

Existing unit tests updated and passing (KernelQueue.test.ts, KernelRouter.test.ts, crank.test.ts, syscall-validation.test.ts, vat-lifecycle.test.ts)
New unit test for orphaned ephemeral exos (orphaned-ephemeral-exo.test.ts)
New e2e test for orphaned ephemeral exos (orphaned-ephemeral-exo.test.ts in kernel-node-runtime)

🤖 Generated with Claude Code

Note

High Risk
High risk because it changes core KernelQueue/KernelRouter crank semantics, rollback behavior, and how message failures propagate (resolve vs reject), which can affect delivery ordering, retries, and many callers/tests.

Overview
Kernel crank semantics are tightened and error propagation is made consistent. KernelQueue.run is rewritten to process exactly one run-queue item per crank, and JS-side subscriptions created by enqueueMessage now support both resolve and reject so rejected kernel promises reject the returned promise.

Rollback and termination handling are hardened. rollbackCrank now refreshes the stored run-queue and invalidates length caches to avoid stale in-memory state after DB rollback, and abort+terminate paths immediately reject the aborted send’s subscription. Kernel vat termination during a crank bypasses terminateVat() to avoid deadlock.

Message “splat” cases are clearer and better handled. KernelRouter improves errors for revoked/no-owner/no-object/endpoint-gone cases, resolves splat rejections using the current promise decider, and treats vanished endpoints as a splat with promise rejection.

Tests/docs updated and expanded. Many tests are updated to expect promise rejections (including remote comms, revocation, lifecycle), new unit+e2e coverage is added for orphaned ephemeral exos across vat restart, kernel-utils exports a new isCapData guard used to rethrow bootstrap errors as real Errors, and the glossary is expanded/clarified (kernel promises/decider/crank definition).

^{Written by Cursor Bugbot for commit 233587c. This will update automatically on new commits. Configure here.}

…staleness - Restructure run queue generator to yield exactly one item per startCrank/endCrank pair, preventing rollback from undoing unrelated earlier deliveries in the same crank - Refresh StoredQueue after rollback so cached head/tail pointers are re-read from DB, fixing dequeue returning undefined - Invalidate runQueueLengthCache after rollback - Bypass VatManager.terminateVat() in KernelQueue callback to avoid waitForCrank() deadlock when terminating from within a crank - Handle vanished endpoints in KernelRouter.deliverSend with try/catch, treating as splat instead of crashing - Change KernelQueue subscriptions to {resolve, reject} so aborted sends can reject the caller's JS promise immediately - Distinguish rejected vs fulfilled in invokeKernelSubscription - Improve splat error messages to describe cause without leaking internal identifiers (krefs, endpoint IDs) - Add integration test for orphaned ephemeral exo rejection - Standardize KernelQueue test loop-exit pattern using sentinel Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

packages/ocap-kernel/src/KernelRouter.ts

… in peer-wallet tests

…se area

github-actions · 2026-03-18T03:29:40Z

Coverage Report

Status	Category	Percentage	Covered / Total
🔵	Lines	77.36% ⬇️ -0.05%	7870 / 10173
🔵	Statements	77.17% ⬇️ -0.05%	7995 / 10360
🔵	Functions	75.24% ⬇️ -0.11%	1891 / 2513
🔵	Branches	75.02% ⬆️ +0.07%	3232 / 4308

File Coverage

File	Stmts	Branches	Functions	Lines	Uncovered Lines
Changed Files
packages/kernel-test/src/vats/orphaned-ephemeral-consumer.ts	0%	100%	0%	0%	14-20
packages/kernel-test/src/vats/orphaned-ephemeral-provider.ts	0%	100%	0%	0%	11-19
packages/kernel-ui/src/components/SendMessageForm.tsx	100% 🟰 ±0%	72.72% ⬇️ -2.28%	100% 🟰 ±0%	100% 🟰 ±0%
packages/kernel-utils/src/index.ts	100% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%
packages/kernel-utils/src/types.ts	100% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%
packages/ocap-kernel/src/Kernel.ts	88.18% ⬆️ +1.03%	77.77% 🟰 ±0%	82.6% ⬆️ +2.17%	88.18% ⬆️ +1.03%	286-289, 306, 330, 398-408, 500, 568, 634-637, 650, 660-661, 704, 721
packages/ocap-kernel/src/KernelQueue.ts	98.23% ⬆️ +0.10%	90.62% ⬆️ +1.95%	100% 🟰 ±0%	98.23% ⬆️ +0.10%	90, 351
packages/ocap-kernel/src/KernelRouter.ts	84.44% ⬇️ -5.72%	73.13% ⬇️ -2.25%	100% 🟰 ±0%	84.44% ⬇️ -5.72%	110, 169, 183, 235-258, 264, 291-300, 307, 353, 368, 371
packages/ocap-kernel/src/types.ts	100% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%
packages/ocap-kernel/src/remotes/kernel/RemoteHandle.ts	88.5% 🟰 ±0%	82.56% 🟰 ±0%	87.5% 🟰 ±0%	88.77% 🟰 ±0%	347, 366-409, 462, 505, 515-517, 558-571, 910, 983, 1029
packages/ocap-kernel/src/store/index.ts	100% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%
packages/ocap-kernel/src/store/types.ts	100% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%
packages/ocap-kernel/src/store/methods/crank.ts	100% 🟰 ±0%	93.75% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%
packages/ocap-kernel/src/vats/SubclusterManager.ts	95.07% ⬇️ -1.30%	88.88% ⬇️ -2.92%	100% 🟰 ±0%	95% ⬇️ -1.32%	194-197, 251, 334, 339-341, 357, 361
packages/ocap-kernel/src/vats/VatHandle.ts	90% ⬆️ +4.29%	85.71% ⬆️ +3.57%	100% 🟰 ±0%	90% ⬆️ +4.29%	305, 356-361, 367-373

Generated in workflow #3971 for commit 233587c by the Vitest Coverage Report Action

FUDCo

I think I see what was going wrong here, and if my analysis is right, this PR does not address it at all. The issue is that the crank transaction is being driven inside the runQueueItems generator function, which is only responsible for pulling the next item off the run queue. The actual delivery happens in run, which iterates over the stream produced by runQueueItems, but the latter commits the transaction before the delivery has even happened. Somehow the refactoring that moved the run queue processing loop out of Kernel.ts and into KernelQueue.ts mangled this. I don't understand by run is in KernelQueue.ts at all.

FUDCo · 2026-03-18T21:50:19Z

packages/ocap-kernel/src/KernelQueue.ts

+        // Queue empty — sleep until woken
+        const { promise, resolve } = makePromiseKit<void>();
+        if (this.#wakeUpTheRunQueue !== null) {
+          Fail`run queue already waiting to be woken; cannot sleep again before the previous wake handler is consumed`;


Even though it would be both technically wrong and wildly inappropriate, I somehow feel the urge to change the error message to "Can't sleep. Clowns will eat me."

packages/ocap-kernel/src/KernelQueue.ts

Simplifies the implementation of the kernel's run loop in a purely behavioral refactor. The previous async generator + loop iteration has been unwrapped into a single loop with multiple helper functions. I noticed that the startCrank() call is the only part of the run loop that can throw an uncaught exception, and made a note to investigate that later. An unrelated TODO comment is also added to the kernel router.

rekmarks · 2026-03-19T04:19:00Z

After further consideration, @FUDCo and I concluded that the run queue implementation was correct as of 46b674d, but the loop + async generator was difficult to reason about. b56cffc attempts to address this by moving to a single loop with helper functions.

rekmarks · 2026-03-19T04:21:05Z

@cursor review

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

cursor · 2026-03-19T04:25:47Z

packages/ocap-kernel/src/KernelQueue.ts

-   *
-   * @yields the next item in the run queue.
-   */
-  async *#runQueueItems(): AsyncGenerator<RunQueueItem> {


Crank lifecycle wraps empty queue check unnecessarily

Medium Severity

startCrank() and createCrankSavepoint('start') are called before checking whether the queue has an item. When the queue is empty, a phantom crank is started, a DB savepoint is created, and then endCrank() immediately releases it — all without any delivery. This violates the PR's stated invariant that a crank consists of exactly one delivery. The startCrank/createCrankSavepoint calls belong inside the else branch (after confirming a queue item exists), not before the emptiness check. As @FUDCo noted: "It's ending the crank before the delivery even happens."

Additional Locations (1)

packages/ocap-kernel/src/KernelQueue.ts#L83-L84

Unfortunately #getNextRunQueueItem() mutates the kernel store if there is an item on the run queue, so we have to start the crank and create the save point before calling it.

packages/ocap-kernel/src/vats/SubclusterManager.ts

rekmarks · 2026-03-19T05:14:14Z

packages/ocap-kernel/src/KernelQueue.ts

-      this.#kernelStore.startCrank();
      let wakeUpPromise: Promise<void> | undefined;
+
+      this.#kernelStore.startCrank();


startCrank() and endCrank() (down in the finally block) are the only parts of the run loop that can throw uncaught errors. This behavior was pre-existing. Is it what we want?

rekmarks · 2026-03-19T19:19:01Z

packages/ocap-kernel/src/KernelQueue.ts

+    if (this.#kernelStore.runQueueLength() > 0) {
+      const item = this.#kernelStore.dequeueRun();
+      if (item) {
+        return item;
+      }
+    }
+    return undefined;


ATTN: in the previous implementation, GC actions, reap actions, and run queue items were processed in this loop:

All GC actions

All reap actions

All run queue items

Now, they are processed in this loop:

All GC actions

All reap actions

One run queue item

Everything obviously appears to work, but we should also convince ourselves that it's correct.

Add "kernel promise" entry distinguishing kernel promises from JS promises, and "decider" entry with function call analogy. Update existing entries to specify "kernel promise" where applicable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

FUDCo

Based on reviewing the recent commits and our conversation earlier today, I think this is OK to go. I do have one pedantic quibble about a glossary entry. Feel free to leave it as is and take care of it later, of if you want to fix it now I promise quick turn around on a rubber stamp.

FUDCo · 2026-03-20T01:21:49Z

docs/glossary.md

-runtime environment for vat code and handles object persistence, promise management, and
-[syscall](#syscall) coordination.
+runtime environment for vat code and handles object persistence, [kernel
+promise](#kernel-promise) management, and [syscall](#syscall) coordination.


Liveslots doesn't actually manage kernel promises at all, the kernel does (that's why they're called kernel promises). The only promises that get exposed outside the vat by liveslots do, in fact, turn into kernel promises, but liveslots doesn't know anything about this.

FUDCo

G2G

rekmarks and others added 4 commits March 17, 2026 12:41

test(kernel-node): Add failing orphaned ephemeral exo test

a855062

chore: Format glossary.md

fe98131

docs: Correct glossary definition of "crank"

95f2884

rekmarks requested a review from FUDCo March 17, 2026 23:59

rekmarks marked this pull request as draft March 18, 2026 00:06

cursor bot reviewed Mar 18, 2026

View reviewed changes

packages/ocap-kernel/src/KernelRouter.ts Show resolved Hide resolved

rekmarks added 4 commits March 17, 2026 17:25

fix(evm-wallet-experiment): expect rejection instead of error CapData…

ec9d7b0

… in peer-wallet tests

fix(kernel-ui): display queueMessage errors in SendMessageForm respon…

7fe3acc

…se area

fix(kernel-test): expect rejection in endowments test for bad-host fetch

5113b6e

fix(kernel-node-runtime): expect rejections in remote-comms e2e tests

46b674d

FUDCo reviewed Mar 18, 2026

View reviewed changes

rekmarks marked this pull request as ready for review March 19, 2026 04:16

rekmarks requested a review from a team as a code owner March 19, 2026 04:16

cursor bot reviewed Mar 19, 2026

View reviewed changes

rekmarks added 3 commits March 18, 2026 21:35

refactor(ocap-kernel): CrankResults -> CrankResult

377a6f6

chore: Remove erroneous comment

69908a7

feat(kernel-utils): add isCapData type guard, use in SubclusterManager

6df2026

rekmarks commented Mar 19, 2026

View reviewed changes

rekmarks added 2 commits March 18, 2026 22:18

docs: Fix #getNextRunQueueItem docstring

1299823

docs: "crank results" -> "crank result"

735b586

rekmarks requested a review from FUDCo March 19, 2026 05:40

refactor: Improve KernelQueue.run() readability

79f437e

rekmarks commented Mar 19, 2026

View reviewed changes

FUDCo previously approved these changes Mar 20, 2026

View reviewed changes

docs: Tweak liveslots glossary entry

233587c

rekmarks dismissed FUDCo’s stale review via 233587c March 20, 2026 01:54

rekmarks requested a review from FUDCo March 20, 2026 01:54

rekmarks enabled auto-merge March 20, 2026 01:54

FUDCo approved these changes Mar 20, 2026

View reviewed changes

rekmarks added this pull request to the merge queue Mar 20, 2026

Merged via the queue into main with commit c1464ed Mar 20, 2026
30 checks passed

rekmarks deleted the rekm/orphaned-exos branch March 20, 2026 02:13

Conversation

rekmarks commented Mar 17, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

github-actions bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report

Uh oh!

FUDCo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FUDCo Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rekmarks commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rekmarks commented Mar 19, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 19, 2026

Choose a reason for hiding this comment

Crank lifecycle wraps empty queue check unnecessarily

Uh oh!

rekmarks Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rekmarks Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

rekmarks Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

FUDCo left a comment

Choose a reason for hiding this comment

Uh oh!

FUDCo Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

FUDCo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rekmarks commented Mar 17, 2026 •

edited by cursor bot

Loading

github-actions bot commented Mar 18, 2026 •

edited

Loading

FUDCo left a comment •

edited

Loading

rekmarks commented Mar 19, 2026 •

edited

Loading